(https://www.kaggle.com/datasets/sudalairajkumar/daily-temperature-of-major-cities)
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
# pio.templates.default = "simple_white"
pio.renderers.default='notebook'
np.random.seed(0)
The analysis process that I did is this:
def load_data(filename):
temp_df = pd.read_csv(filename, parse_dates=['Date'], dayfirst=True)
temp_df = temp_df.drop_duplicates().dropna()
temp_df = temp_df[temp_df["Temp"] >= - 15]
dates = pd.to_datetime(temp_df["Date"], errors='coerce').dt.to_period('h')
temp_df["DayOfYear"] = dates.dt.day_of_year
temp_df["Year"] = temp_df["Year"].astype(str)
return temp_df
# Load city daily temperature dataset and preprocess data.
df = load_data("./City_Temperature.csv")
df
| Country | City | Date | Year | Month | Day | Temp | DayOfYear | |
|---|---|---|---|---|---|---|---|---|
| 0 | South Africa | Capetown | 1995-01-01 | 1995 | 1 | 1 | 19.333333 | 1 |
| 1 | South Africa | Capetown | 1995-01-02 | 1995 | 1 | 2 | 19.888889 | 2 |
| 2 | South Africa | Capetown | 1995-01-03 | 1995 | 1 | 3 | 19.388889 | 3 |
| 3 | South Africa | Capetown | 1995-01-04 | 1995 | 1 | 4 | 20.833333 | 4 |
| 4 | South Africa | Capetown | 1995-01-05 | 1995 | 1 | 5 | 21.444444 | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 32434 | Jordan | Amman | 2020-05-09 | 2020 | 5 | 9 | 17.555556 | 130 |
| 32435 | Jordan | Amman | 2020-05-10 | 2020 | 5 | 10 | 17.055556 | 131 |
| 32436 | Jordan | Amman | 2020-05-11 | 2020 | 5 | 11 | 20.666667 | 132 |
| 32437 | Jordan | Amman | 2020-05-12 | 2020 | 5 | 12 | 24.444444 | 133 |
| 32438 | Jordan | Amman | 2020-05-13 | 2020 | 5 | 13 | 18.500000 | 134 |
31898 rows × 8 columns
Let us subset the dataset to caintain samples only from the country of Israel, so we can investigate how the average daily temperature (Temp column) change as a function of the DayOfYear
df_israel = df[df["Country"] == "Israel"]
df_israel_avg = df_israel.groupby(["Year", "DayOfYear"], as_index=False)["Temp"].mean()
fig = px.scatter(df_israel_avg, x="DayOfYear", y="Temp", color="Year",
title="Figure (1) Average daily temperature as a function of the DayOfYear")
fig.show()
Based on the this plot, one can note that data behaves pretty similar among different year, and it has a shape of a wave, with higher temp around day ~200 of the year.
Since we have three extreme points we can assume that a polynomial with degree of 3 or 4 might be suitable for this data.
Now we will group the samples by Month and create a bar plot showing for each month the std of the daily temperatures.
df_israel_months = df_israel.groupby(["Month"], as_index=False)["Temp"].std()
fig = px.bar(df_israel_months, x='Month', y='Temp',
labels={'Temp': 'std'},
title="Figure (2) Standard Deviation Of The Daily Temperatures Over Months")
fig.show()
Suppose we fit a polynomial model (with the correct degree) over data sampled uniformly at random from this dataset, and then use it to predict temperatures from random days across the year.
Based on this graph, I would expect this model wont succeed equally in prediction across all months. In months with low variance (June [6] - September [9]), I would expect that this model would preform better and will probably will fit closer to reality. I assume that it will do the worst on the months March and April (3 and 4) which are months with high variability.
This is under the assumption the the test set is generated from the same distribution as train set.
And now, back to the full dataset: we will group the samples according to Country and Month, and calculate the average and standard deviation of the temperature.
We will Plot a line plot of the average monthly temperature, with error bars color coded by the country.
df_3 = df.groupby(["Country", "Month"], as_index=False)["Temp"].agg({
'avgTemp': 'mean',
'std': 'std'})
fig = px.line(df_3, x="Month", y="avgTemp", color="Country", error_y="std",
labels={'avgTemp': 'Average Temperature'},
title="Figure (3) Average monthly temperature as a function of the Month")
fig.show()
Based on the graph above, one can note that not all countries share the same pattern in term of haing the same distibution of average monthly temperature as a funciton of the month.
According to this plot we expect that a model fitted for Israel data only will preform very well on Jordan, whereas the model likely wont work on South Africa or on The Netherlands. This is becuase South Africa's trends are opposite to those of the other three countries (e.g. relatively hot in months 6-9 in Israel, however this is the cold period in South Africa), and on the other hand The Netherlands tempAvg is quite far from those values of Israel. It's distibution (of Netherlands) is similar to that of Israel, with difference of ~9 degrees lower any time of the year. Thus, I can use the model fitted for Israel by simply adjusting the value of the intercept.
Over the subset containing observations only from Israel we will do the following:
Then we will create a bar plot showing the test error recorded for each value of k. This is in order to find which value of k best fits the data.
train_X, test_X, train_y, test_y = train_test_split(df_israel["DayOfYear"], df_israel["Temp"], test_size=0.25)
losses = {'k': [], 'test_error': [], 'test_error_rounded': []}
for k in list(range(1, 11)):
z = np.poly1d(np.polyfit(train_X.values, train_y, k))
pred_y = z(test_X.values)
error = mean_squared_error(pred_y, test_y)
losses['k'].append(k)
losses['test_error'].append(error)
losses['test_error_rounded'].append(round(error, 2))
# print(f"Degree k={k}, test error: {round(error, 2)}")
fig = px.bar(losses, x='k', y='test_error', text='test_error_rounded',
title="Figure (4) Test Error as a function of Polynomial Degree")
fig.show()
Based on this, I would choose the valueof k=5 as best fits and describes the data (the lowest error, above this value it looks like overfitting).
Now we will fit a model over the entire subset of records from Israel using the degree of k=5 chosen above.
And create a bar plot showing the model’s error over each of the other countries.
# model = PolynomialFitting(k=5).fit(df_israel["DayOfYear"], )
model = np.poly1d(np.polyfit(df_israel["DayOfYear"].values, df_israel["Temp"], 5))
countries = ["South Africa", "Jordan", "The Netherlands"]
losses = {'Country': [], 'test_error': []}
for c in countries:
df_cur_country = df[df["Country"] == c]
losses['Country'].append(c)
error = mean_squared_error(model(df_cur_country["DayOfYear"]), df_cur_country["Temp"])
losses['test_error'].append(error)
fig = px.bar(losses, x='Country', y='test_error', title="Figure (5) Temperature Over Months")
fig.show()
As we expecded, the model fitted over the subset of observations from Israel performed the best on Jordan, and in general it less good over data from other countries. As we have seen in figure 3, the distribution of temperatures in Jordan resembles that of Israel. Therefore, out of the three countries, the model performed best on Jordan.
The distributions of South Africa and Netherlands were further from those of Israel and therefore the fitted model performed poorly over them.
Although the distribution of the temp data from the Netherlands has a very similar shape to that of Israel, and that the distribution of the observations from South Africa is very different, the model performed better over South Africa. This is probably because on average the observations from Israel are closer to those of South Africa. Hence, although the model does not correctly mimics the distribution of observations from South Africa, the errors are still smaller than in the case of observations from the Netherlands.